The intent of this report is to better understand trends related to the COVID-19 pandemic on the national, international, and global scale. This will be done by analyzing geographic and temporal data for the number of cases and deaths of the COVID-19 pandemic.
The data used for this report is courtesy of John Hopkins University COVID-19. The datasets can be found here.
The first two datasets used include a daily count of the cases and deaths, respectively, for every country. The last two datasets include a daily count of the cases and deaths, respectively, for each state of the United States (US). The data ranges between January 2020 and March 2023. The global data is subdivided by province or state, whereas the US data is subdivided by state and county. All data frames contain information about the latitude and longitude corresponding to a given location.
The data can be read in through the URLs below. Additionally, the packages that will be used for this report have been included below for reference. Note: Please install these packages, if required.
#load in packages
library(tidyverse)
library(ggplot2)
library(lubridate)
library(countrycode)
library(plotly)
#read in files
url_in <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
file_names <-
c(
"time_series_covid19_confirmed_global.csv",
"time_series_covid19_deaths_global.csv",
"time_series_covid19_confirmed_US.csv",
"time_series_covid19_deaths_US.csv"
)
#string concatenate file names
urls <- str_c(url_in, file_names)
#read in all files using urls
global_cases <- read_csv(urls[1])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_deaths <- read_csv(urls[2])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
US_cases <- read_csv(urls[3])
## Rows: 3342 Columns: 1154
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1148): UID, code3, FIPS, Lat, Long_, 1/22/20, 1/23/20, 1/24/20, 1/25/20...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
US_deaths <- read_csv(urls[4])
## Rows: 3342 Columns: 1155
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1149): UID, code3, FIPS, Lat, Long_, Population, 1/22/20, 1/23/20, 1/24...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#show summary statistics and quick overview of each frame.
head(global_cases)
## # A tibble: 6 × 1,147
## `Province/State` `Country/Region` Lat Long `1/22/20` `1/23/20` `1/24/20`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> Afghanistan 33.9 67.7 0 0 0
## 2 <NA> Albania 41.2 20.2 0 0 0
## 3 <NA> Algeria 28.0 1.66 0 0 0
## 4 <NA> Andorra 42.5 1.52 0 0 0
## 5 <NA> Angola -11.2 17.9 0 0 0
## 6 <NA> Antarctica -71.9 23.3 0 0 0
## # … with 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>,
## # `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
## # `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>,
## # `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>,
## # `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
## # `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>,
## # `2/16/20` <dbl>, `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, …
head(global_deaths)
## # A tibble: 6 × 1,147
## `Province/State` `Country/Region` Lat Long `1/22/20` `1/23/20` `1/24/20`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> Afghanistan 33.9 67.7 0 0 0
## 2 <NA> Albania 41.2 20.2 0 0 0
## 3 <NA> Algeria 28.0 1.66 0 0 0
## 4 <NA> Andorra 42.5 1.52 0 0 0
## 5 <NA> Angola -11.2 17.9 0 0 0
## 6 <NA> Antarctica -71.9 23.3 0 0 0
## # … with 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>,
## # `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
## # `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>,
## # `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>,
## # `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
## # `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>,
## # `2/16/20` <dbl>, `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, …
head(US_cases)
## # A tibble: 6 × 1,154
## UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 84001001 US USA 840 1001 Autauga Alabama US 32.5
## 2 84001003 US USA 840 1003 Baldwin Alabama US 30.7
## 3 84001005 US USA 840 1005 Barbour Alabama US 31.9
## 4 84001007 US USA 840 1007 Bibb Alabama US 33.0
## 5 84001009 US USA 840 1009 Blount Alabama US 34.0
## 6 84001011 US USA 840 1011 Bullock Alabama US 32.1
## # … with 1,145 more variables: Long_ <dbl>, Combined_Key <chr>,
## # `1/22/20` <dbl>, `1/23/20` <dbl>, `1/24/20` <dbl>, `1/25/20` <dbl>,
## # `1/26/20` <dbl>, `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>,
## # `1/30/20` <dbl>, `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>,
## # `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>,
## # `2/7/20` <dbl>, `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>,
## # `2/11/20` <dbl>, `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, …
head(US_deaths)
## # A tibble: 6 × 1,155
## UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 84001001 US USA 840 1001 Autauga Alabama US 32.5
## 2 84001003 US USA 840 1003 Baldwin Alabama US 30.7
## 3 84001005 US USA 840 1005 Barbour Alabama US 31.9
## 4 84001007 US USA 840 1007 Bibb Alabama US 33.0
## 5 84001009 US USA 840 1009 Blount Alabama US 34.0
## 6 84001011 US USA 840 1011 Bullock Alabama US 32.1
## # … with 1,146 more variables: Long_ <dbl>, Combined_Key <chr>,
## # Population <dbl>, `1/22/20` <dbl>, `1/23/20` <dbl>, `1/24/20` <dbl>,
## # `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>, `1/28/20` <dbl>,
## # `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>, `2/1/20` <dbl>,
## # `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>,
## # `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>, `2/9/20` <dbl>,
## # `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>, `2/13/20` <dbl>, …
A summary of each dataset has been provided through the included console messages. However, for the purposes of further analyses, data for cases and deaths will be combined (for global and US data individually).
Further data wrangling and transformation will be done for each section in this report. However, an initial tidying step has been included below to show the general structure that will be utilized for the analysis.
First, the global data will be tidied to provide an preliminary data frame. Population data, also provided by John Hopkins University, will be incorporated into the data frames.
#Tidy data - global
global_cases <- global_cases %>%
pivot_longer(
cols = -c(`Province/State`,
`Country/Region`,
Lat,
Long),
#dates in the columns, which serve as names
#cases are the data in the dates columns, which are values
names_to = "date",
values_to = "cases"
) %>%
#clean up lat and long columns since they aren't needed
select(-c(Lat, Long))
#tidy up global deaths
global_deaths <- global_deaths %>%
pivot_longer(cols = -c(`Province/State`,
`Country/Region`,
Lat,Long),
names_to = "date",
values_to = "deaths") %>%
select(-c(Lat, Long))
#join cases and death data
global <- global_cases %>%
full_join(global_deaths) %>%
rename(Country_Region = `Country/Region`,
Province_State = `Province/State`) %>%
#change date to date object
mutate(date = mdy(date))
#missing population data for global data set.
#need a combined key
global <- global %>%
#create combined key using unite
unite("Combined_Key",
c(Province_State, Country_Region),
#comma separator
sep = ", ",
na.rm = TRUE,
remove = FALSE
)
#need population data.
uid_lookup_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
uid <- read_csv(uid_lookup_url) %>%
select(-c(Lat, Long_, Combined_Key, code3, iso2, iso3, Admin2))
#join by population
global <- global %>%
left_join(uid, by = c("Province_State", "Country_Region")) %>%
select(-c(UID, FIPS)) %>%
select(Province_State, Country_Region, date,
cases, deaths, Population,
Combined_Key)
summary(global)
## Province_State Country_Region date cases
## Length:330327 Length:330327 Min. :2020-01-22 Min. : 0
## Class :character Class :character 1st Qu.:2020-11-02 1st Qu.: 680
## Mode :character Mode :character Median :2021-08-15 Median : 14429
## Mean :2021-08-15 Mean : 959384
## 3rd Qu.:2022-05-28 3rd Qu.: 228517
## Max. :2023-03-09 Max. :103802702
##
## deaths Population Combined_Key
## Min. : 0 Min. :6.700e+01 Length:330327
## 1st Qu.: 3 1st Qu.:5.790e+05 Class :character
## Median : 150 Median :6.574e+06 Mode :character
## Mean : 13380 Mean :2.769e+07
## 3rd Qu.: 3032 3rd Qu.:2.642e+07
## Max. :1123836 Max. :1.380e+09
## NA's :10287
The US data can also be tidied in a similar way.
US_cases <- US_cases %>%
#use combined key to retain cols
pivot_longer(cols = -(UID:Combined_Key),
names_to = "date",
values_to = "cases") %>%
#clean up some more
#county name select
select(Admin2:cases) %>%
mutate(date = mdy(date)) %>%
select(-c(Lat, Long_))
#do same for US deaths
US_deaths <- US_deaths %>%
pivot_longer(cols = -(UID:Population),
names_to = "date",
values_to = "deaths") %>%
#clean up some more
#county name select
select(Admin2:deaths) %>%
mutate(date = mdy(date)) %>%
select(-c(Lat, Long_))
#join data
US <- US_cases %>%
full_join(US_deaths)
summary(US)
## Admin2 Province_State Country_Region Combined_Key
## Length:3819906 Length:3819906 Length:3819906 Length:3819906
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## date cases Population deaths
## Min. :2020-01-22 Min. : -3073 Min. : 0 Min. : -82.0
## 1st Qu.:2020-11-02 1st Qu.: 330 1st Qu.: 9917 1st Qu.: 4.0
## Median :2021-08-15 Median : 2272 Median : 24892 Median : 37.0
## Mean :2021-08-15 Mean : 14088 Mean : 99604 Mean : 186.9
## 3rd Qu.:2022-05-28 3rd Qu.: 8159 3rd Qu.: 64979 3rd Qu.: 122.0
## Max. :2023-03-09 Max. :3710586 Max. :10039107 Max. :35545.0
Note: Please click on the tabs at the top of this report to switch between sections.
Main Question: What did the pandemic look like at its “peak” for each country?
There are many ways to define the “peak” of the pandemic. For the purposes of this analysis, the point in time in which each country reached its max number of COVID-19 cases per million people will be used. To provide more context, a visualization will be created that measures this variable against the number of COVID-19 deaths per million people.
To visualize the relationship between these variables, a specialized
scatter plot, known as a bubble chart, will be used. This visualization
allows for an extra dimension to be visualized determined by the size of
the bubble. Additionally, the bubbles will be colour-coded to represent
the continents. In summary, the following elements are of note:
-
The x-axis: COVID-19 cases per million people for a given country.
- The y-axis: COVID-19 deaths per million people for a given country.
- Bubble size: The population of a given country. Larger bubbles
represent larger populations. Population data has been sourced from John
Hopkins University, as outlined in the previous section.
- Bubble
colour: The continent that a given country belongs to. This will be
categorized using the countrycode package. There are some
conflicts that arise from the data, which are handled below.
#aggregate data based on totals
global_total <- global %>%
group_by(Country_Region, date) %>%
summarize(cases = sum(cases), deaths = sum(deaths),
Population = sum(Population)) %>%
select(Country_Region, date, cases, deaths,
Population) %>%
ungroup()
#add continent to global
global_total$Continent <- countrycode(
sourcevar = global_total$Country_Region,
origin = "country.name",
destination = "continent"
)
## Warning: Some values were not matched unambiguously: Diamond Princess, Kosovo, Micronesia, MS Zaandam, Summer Olympics 2020, Winter Olympics 2022
summary(global_total)
## Country_Region date cases deaths
## Length:229743 Min. :2020-01-22 Min. : 0 Min. : 0
## Class :character 1st Qu.:2020-11-02 1st Qu.: 3831 1st Qu.: 46
## Mode :character Median :2021-08-15 Median : 52933 Median : 786
## Mean :2021-08-15 Mean : 1379412 Mean : 19238
## 3rd Qu.:2022-05-28 3rd Qu.: 499592 3rd Qu.: 7227
## Max. :2023-03-09 Max. :103802702 Max. :1123836
##
## Population Continent
## Min. :8.090e+02 Length:229743
## 1st Qu.:1.886e+06 Class :character
## Median :8.696e+06 Mode :character
## Mean :3.246e+07
## 3rd Qu.:2.769e+07
## Max. :1.380e+09
## NA's :8001
There are a number of missing values that are present in the data, particularly in the Population column. The output from the continent classification indicates that further investigation into these values is also required.
#investigate missing values and continent conflicts
non_country <- unique(global_total$Country_Region[which(is.na(global_total$Population))])
non_country
## [1] "Antarctica" "Canada" "China"
## [4] "Diamond Princess" "MS Zaandam" "Summer Olympics 2020"
## [7] "Winter Olympics 2022"
non_con <- unique(global_total$Country_Region[which(is.na(global_total$Continent))])
non_con
## [1] "Diamond Princess" "Kosovo" "Micronesia"
## [4] "MS Zaandam" "Summer Olympics 2020" "Winter Olympics 2022"
Investigating the missing values reveals that some non-countries (and countries) have been included in the data, which is why they do not have corresponding population values. These values include cruise ships and Olympic events. This is valuable data, but for the purposes of this analysis, they will not be considered as countries. There seems to be some sort of error in displaying the populations of Canada and China most likely due to the way their provinces have been recorded. These values will be added manually.
There are also some countries that have not been classified by continent. Since they have been included in the COVID-19 data, they will be considered countries for the analysis. Kosovo will be considered a European country, and Micronesia will be considered an Oceanic country.
Finally, once the missing values have been resolved, the number of COVID-19 cases and deaths per million people per country can be added without errors arising.
#remove countries that aren't actually countries.
non_country <- non_country[!non_country %in% c("China", "Canada")]
global_total <- global_total %>%
filter(!Country_Region %in% non_country)
sum(is.na(global_total$Population))
## [1] 2286
#update population values for China and Canada
pop_china <- uid$Population[uid$Country_Region == "China" &
is.na(uid$Province_State)]
pop_canada <- uid$Population[uid$Country_Region == "Canada" &
is.na(uid$Province_State)]
global_total <- global_total %>%
mutate(Population = case_when(
Country_Region == "China" & is.na(Population) ~ pop_china,
Country_Region == "Canada" & is.na(Population) ~ pop_canada,
TRUE ~ Population
))
#ensure that the number of countries seems reasonable
length(unique(global_total$Country_Region))
## [1] 196
#check population values are reasonable
min(global_total$Population)
## [1] 809
max(global_total$Population)
## [1] 1411778724
#Add in Kosovo and Micronesia's continents
kos <- global_total$Country_Region == "Kosovo"
global_total$Continent[kos] <- "Europe"
micro <- global_total$Country_Region == "Micronesia"
global_total$Continent[micro] <- "Oceania"
unique(global_total$Continent)
## [1] "Asia" "Europe" "Africa" "Americas" "Oceania"
sum(is.na(global_total))
## [1] 0
#add cases and deaths per thousand people
global_total <- global_total %>%
mutate(deaths_per_mill = deaths *1000000/ Population) %>%
mutate(cases_per_mill = cases *1000000/ Population)
Now that the data has been transformed and the missing values have been addressed, the visualization can be created. Since a bubble chart of this scale has a great deal of information, an interactive graph is useful for users to further investigate the data. An interactive plot will be created using the plotly package.
#peak of the pandemic values
global_max_cases <- global_total %>%
group_by(Country_Region) %>%
slice(which.max(cases))
#capture American cases as a benchmark.
amer_cases <- global_max_cases %>%
filter(Country_Region == "US")
#create visualization
global_max_cases %>%
plot_ly(
type = "scatter",
mode = "markers",
#round values
x = ~round(cases_per_mill, digits = 2),
y = ~round(deaths_per_mill, digits = 2),
size = ~Population,
color = ~Continent,
#alter size of markers slightly for clarity
marker = list(sizeref = 0.4, sizemode = "area"),
text = ~Country_Region
) %>%
layout(
title = "Peak Global COVID-19 Cases and Deaths per Million",
xaxis = list(title = "Cases per Million"),
yaxis = list(title = "Deaths per Million"),
#add annotation to highlight the US.
annotations = list(
x = amer_cases$cases_per_mill,
y = amer_cases$deaths_per_mill,
text = amer_cases$Country_Region,
xref = "x",
yref = "y",
showarrow = FALSE
)
)
This provides a snapshot of what each country’s “peak” looks like. However, this raises the question of how trends are measured over time. The “peak” does not reflect the nuances in case or death trajectories. These might occur due to considerations such as adopting new policies or rolling out vaccines. The data point for the United States has been labelled to serve as a benchmark.
Visualizing trends over time can aid in understanding how COVID-19 has affected countries over time. To do this, an animated plotly visualization can be created.
An important caveat is that the population for a given country will change over time. This is further complicated by reporting measures during the pandemic. For the purpose of this analysis, the populations used will remain static.
First, the data will be transformed to capture the cases and deaths per million per month for each country. In this case, the “per month” value will be obtained by taking the number of cases and deaths for the last day of each month. Since the data is cumulative, this will, by default, represent the number of cases at the end of a given month. Since the data only extends to 2023-03-09, this day will represent the last day of March 2023.
#find the range of the dates
range(global_total$date)
## [1] "2020-01-22" "2023-03-09"
#create a vector consisting of the last day of every month
month_dates <- seq(as.Date("2020-02-01"), length = 39, by = "1 month")-1
#replace value for March - data only goes up to March 9
month_dates <- replace(month_dates,month_dates == "2023-03-31", as.Date("2023-03-09"))
#filter global data set by last day of every month.
global_monthly_totals <- global_total %>%
filter(date %in% month_dates)
#the "animation" frame works by transforming dates to factors.
global_monthly_totals$date <- as.factor(global_monthly_totals$date)
Note: To see the reported cases and deaths per million each month by country, click the “play” button below. The slider can be used to navigate to a specific month. The date is displayed above the slider.
global_monthly_totals %>%
plot_ly(
type = "scatter",
mode = "markers",
x = ~round(cases_per_mill, digits = 2),
y = ~round(deaths_per_mill, digits = 2),
size = ~Population,
color = ~Continent,
marker = list(sizeref = 0.4, sizemode = "area"),
#animate using the frame parameter
frame = ~date,
text = ~Country_Region
) %>%
layout(
title = "Global COVID-19 Cases and Deaths per Million Over Time",
xaxis = list(title = "Cases per Million"),
yaxis = list(title = "Deaths per Million")
) %>%
#alter the speed slightly for visual clarity
animation_opts(
750,
redraw = FALSE
) %>%
#rename the prefix that appears before the date
animation_slider(
currentvalue = list(prefix = "Date: ")
)
amer_cases
## # A tibble: 1 × 8
## # Groups: Country_Region [1]
## Country_Region date cases deaths Population Continent deaths_per_mill
## <chr> <date> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 US 2023-03-09 1.04e8 1.12e6 329466283 Americas 3411.
## # … with 1 more variable: cases_per_mill <dbl>
global_max_cases[which.max(global_max_cases$cases_per_mill),]
## # A tibble: 1 × 8
## # Groups: Country_Region [1]
## Country_Region date cases deaths Population Continent deaths_per_mill
## <chr> <date> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 San Marino 2023-03-07 23616 122 33938 Europe 3595.
## # … with 1 more variable: cases_per_mill <dbl>
global_max_cases[which.max(global_max_cases$deaths_per_mill),]
## # A tibble: 1 × 8
## # Groups: Country_Region [1]
## Country_Region date cases deaths Population Continent deaths_per_mill
## <chr> <date> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Peru 2023-03-08 4487553 219539 32971846 Americas 6658.
## # … with 1 more variable: cases_per_mill <dbl>
mean(global_max_cases$cases_per_mill)
## [1] 169917
The country with the most COVID-19 cases per million at its peak is San Marino. San Marino has a population of approximately 34,000 and had close to 24,000 cases at its peak.The country with the most deaths per million is Peru. On average, there were 169,917 cases per million people across all countries when considering the peak for each country. America, at its peak, had 315,063 cases per million people.
When considering trends by continental region, European countries have varying cases per million, but middling deaths per million. Countries in the Americas and Africa are generally on the lower end of cases and deaths per million. A strong trend does not persist for Asian countries. Oceania has varying cases per million, but low deaths per million.
The temporal changes in cases and deaths per million can also be analyzed by observing the changes in the number of cases per month. The ratio between cases and deaths per million remains fairly consistent on a country-by-country level until the end of 2021. There is a noticeable spike in the number of cases per million at this time, but not a large shift in the number of deaths per million. A number of factors could contribute toward this observed trend, such as vaccination rates, policy changes, and COVID-19 variants.
Future directions for analysis include some of the epidemiological considerations for the spread of viruses. The incorporation of variables regarding vaccines, policies regarding practices for COVID-19, awareness campaigns, and hospitalization rates serve as interesting future directions for better understanding the trends observed in the spread of COVID-19.
Main Question: How did the United States fare in terms of cases and deaths at the state level?
To explore the effects of COVID-19 in the US, a choropleth map can be created. The ratio between cases and deaths, or fatalities, can serve as a measure of the performance of each individual state.
To investigate further, the data must be tidied further to handle missing values and aggregate cases at the state level.
#subset values that are missing
US_missing <- US[which(is.na(US$Admin2)),]
US_missing_PS <- unique(US_missing$Province_State)
US_missing_PS
## [1] "American Samoa" "Diamond Princess"
## [3] "Grand Princess" "Guam"
## [5] "Northern Mariana Islands" "Virgin Islands"
#filter data
US <- US %>%
filter(!Province_State %in% US_missing_PS)
sum(is.na(US))
## [1] 0
Some of the missing values stem from the inclusion of US territories. While this is valuable data, the analysis is focused on the effect of the pandemic on the states of the US. As such, these values can be safely removed without having an impact on the analysis.
The last available day of the data will be used to determine the case to fatality ratio, as the cumulative data will represent the totality of these variables.
#cases/deaths by state
US_by_state <- US %>%
#group data by state, region
group_by(Province_State, Country_Region, date) %>%
#for each state, take the cases to be the sum of cases in the state
summarize(cases = sum(cases), deaths = sum(deaths),
Population = sum(Population)) %>%
#filter only last day in data frame
filter(date == as.Date("2023-03-09")) %>%
#add case:fatality ratio
mutate(cases_to_deaths = cases/deaths)
#check which states aren't supposed to be part of the 50
#state.name is a built-in vector.
not_states <- US_by_state[!US_by_state$Province_State %in% state.name ,]
not_states
## # A tibble: 2 × 7
## # Groups: Province_State, Country_Region [2]
## Province_State Country_Region date cases deaths Population
## <chr> <chr> <date> <dbl> <dbl> <dbl>
## 1 District of Columbia US 2023-03-09 177945 1432 705749
## 2 Puerto Rico US 2023-03-09 1101469 5823 3754939
## # … with 1 more variable: cases_to_deaths <dbl>
US_by_state <- US_by_state %>%
filter(!Province_State %in% not_states$Province_State)
#add state abbreviations for graphing purposes.
US_by_state$State_Abbr <- state.abb
To complete the visualization, the data included must only be the 50 US states. Puerto Rico and the District of Columbia have been included in the COVID-19 data at the state level. For the purpose of this analysis, they will be removed to maintain the focus on states. Abbreviations for each state have been including using the built-in R state data. These are required for the plot_geo function.
US_by_state %>%
#indicate that US states are being used
plot_geo(
locationmode = 'USA-states'
) %>%
add_trace(
z= ~cases_to_deaths,
locations = ~State_Abbr,
color= ~cases_to_deaths,
hoverinfo = 'text',
#alter the text that appears on hover
text = ~paste(
'</br> State: ', Province_State, ' (',State_Abbr ,')',
'</br> Ratio: ', round(cases_to_deaths, digits = 2),
'</br> Cases: ', round(cases, digits = 2),
'</br> Deaths: ', round(deaths, digits = 2)
)
) %>%
layout(
#zoom into graph so that only the US is seen.
geo = list(scope = 'usa'),
title = "COVID-19 Case to Fatality Ratio for US States"
) %>%
#change legend title - slightly different procedure for continuous scale legends.
colorbar(title = "Case to Fatality Ratio")
US_by_state %>%
#"ungroup" here for the following operations.
ungroup() %>%
arrange(desc(cases_to_deaths)) %>%
slice(1, length(US_by_state$cases_to_deaths))
## # A tibble: 2 × 8
## Province_State Country_Region date cases deaths Population
## <chr> <chr> <date> <dbl> <dbl> <dbl>
## 1 Alaska US 2023-03-09 307655 1486 740995
## 2 Pennsylvania US 2023-03-09 3527854 50398 12801989
## # … with 2 more variables: cases_to_deaths <dbl>, State_Abbr <chr>
mean(US_by_state$cases_to_deaths)
## [1] 101.2346
The state with the lowest case to death ratio is Pennsylvania. The highest ratio is Alaska. On average, the case to fatality ratio was 101 cases:deaths. While the ratio accurately represents the relationship between cases and deaths, it does not reflect the population sizes of each state.
This ratio does provide some insight into COVID-19 activity at the state level, but it may not capture the nuances of COVID-19 transmission. Future directions may include the integration of vaccination rates and population sizes. Additionally, further analysis of states grouped by their respective regions would provide higher-order trends.
Main question: Is there a relationship between the number of cases per million and deaths per million for countries in the Americas?
To answer this question, a simple linear model will be created to determine if it is an appropriate way to describe the relationship between cases and deaths per million.
First, the “peak” COVID-19 data will be repurposed. It will be filtered to only contain values for countries in the Americas. Please note that the “peak” here represents the day that the cases reached a maximum point for individual countries.
#filter max cases data
americas_max_cases <- global_max_cases %>%
filter(Continent == "Americas")
#preliminary missing checks
sum(is.na(americas_max_cases))
## [1] 0
#linear model
mod <- lm(deaths_per_mill ~ cases_per_mill, data = americas_max_cases)
summary(mod)
##
## Call:
## lm(formula = deaths_per_mill ~ cases_per_mill, data = americas_max_cases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1398.2 -727.6 -193.9 462.6 4751.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.054e+03 3.473e+02 3.034 0.00468 **
## cases_per_mill 6.269e-03 2.141e-03 2.928 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1135 on 33 degrees of freedom
## Multiple R-squared: 0.2062, Adjusted R-squared: 0.1822
## F-statistic: 8.574 on 1 and 33 DF, p-value: 0.006138
#use linear model to predict deaths based on linear model.
americas_max_cases_pred <- americas_max_cases %>%
ungroup() %>%
mutate(pred = predict(mod))
#to see how good our model is, we can plot predictions against actual values.
americas_max_cases_pred %>%
plot_ly(
type = "scatter",
mode = "markers",
x = ~cases_per_mill,
y = ~deaths_per_mill,
text = ~Country_Region,
name = "Actual"
) %>%
add_trace(
x = ~cases_per_mill,
y = ~pred,
text = ~Country_Region,
name = "Predicted"
) %>%
layout(
title = "Predicted and Actual COVID-19 Cases and Deaths per Million - The Americas",
xaxis = list(title = "Cases per Million"),
yaxis = list(title = "Deaths per Million")
)
There is a clear outlier present in the data. Peru has an extremely high deaths per million value in comparison to its cases per million. As an exercise, Peru’s values will be removed to observe how its effect on the analysis.
americas_max_cases_outlier <- americas_max_cases %>%
filter(!Country_Region == "Peru")
mod1 <- lm(deaths_per_mill ~ cases_per_mill, data = americas_max_cases_outlier)
summary(mod1)
##
## Call:
## lm(formula = deaths_per_mill ~ cases_per_mill, data = americas_max_cases_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1257.13 -586.02 -73.88 593.87 1347.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.160e+02 2.386e+02 3.839 0.000549 ***
## cases_per_mill 6.254e-03 1.464e-03 4.271 0.000163 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 776.5 on 32 degrees of freedom
## Multiple R-squared: 0.363, Adjusted R-squared: 0.3431
## F-statistic: 18.24 on 1 and 32 DF, p-value: 0.0001628
americas_max_cases_outlier_pred <- americas_max_cases_outlier %>%
ungroup() %>%
mutate(pred = predict(mod1))
americas_max_cases_outlier_pred %>%
plot_ly(
type = "scatter",
mode = "markers",
x = ~cases_per_mill,
y = ~deaths_per_mill,
text = ~Country_Region,
name = "Actual"
) %>%
add_trace(
x = ~cases_per_mill,
y = ~pred,
text = ~Country_Region,
name = "Predicted"
) %>%
layout(
title = "Predicted and Actual COVID-19 Cases and Deaths per Million - The Americas",
xaxis = list(title = "Cases per Million"),
yaxis = list(title = "Deaths per Million")
)
#min/max values
americas_max_cases_pred %>%
ungroup() %>%
#arrange in ascending order - least to most dpm
arrange((deaths_per_mill)) %>%
slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
## Country_Region cases_per_mill deaths_per_mill pred date cases deaths
## <chr> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
## 1 Nicaragua 2363. 37.0 1068. 2023-03-08 1.57e4 2.45e2
## 2 Haiti 3000. 75.4 1072. 2023-03-04 3.42e4 8.6 e2
## 3 Venezuela 19418. 206. 1175. 2023-03-09 5.52e5 5.85e3
## 4 Dominican Repu… 60914. 404. 1436. 2023-03-08 6.61e5 4.38e3
## 5 El Salvador 31110. 651. 1249. 2022-08-30 2.02e5 4.22e3
## 6 Trinidad and T… 135705. 3112. 1904. 2023-03-07 1.90e5 4.36e3
## 7 Brazil 174451. 3290. 2147. 2023-03-03 3.71e7 6.99e5
## 8 Chile 271617. 3362. 2756. 2023-03-09 5.19e6 6.43e4
## 9 US 315063. 3411. 3029. 2023-03-09 1.04e8 1.12e6
## 10 Peru 136103. 6658. 1907. 2023-03-08 4.49e6 2.20e5
## # … with 2 more variables: Population <dbl>, Continent <chr>
americas_max_cases_pred %>%
ungroup() %>%
#arrange in ascending order - least to most dpm
arrange((pred)) %>%
slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
## Country_Region cases_per_mill deaths_per_mill pred date cases deaths
## <chr> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
## 1 Nicaragua 2363. 37.0 1068. 2023-03-08 1.57e4 2.45e2
## 2 Haiti 3000. 75.4 1072. 2023-03-04 3.42e4 8.6 e2
## 3 Venezuela 19418. 206. 1175. 2023-03-09 5.52e5 5.85e3
## 4 El Salvador 31110. 651. 1249. 2022-08-30 2.02e5 4.22e3
## 5 Honduras 47680. 1122. 1353. 2023-03-06 4.72e5 1.11e4
## 6 Panama 239116. 1995. 2553. 2023-03-06 1.03e6 8.61e3
## 7 Chile 271617. 3362. 2756. 2023-03-09 5.19e6 6.43e4
## 8 Uruguay 297750. 2193. 2920. 2023-02-27 1.03e6 7.62e3
## 9 US 315063. 3411. 3029. 2023-03-09 1.04e8 1.12e6
## 10 Barbados 371638. 2015. 3384. 2023-03-09 1.07e5 5.79e2
## # … with 2 more variables: Population <dbl>, Continent <chr>
americas_max_cases_outlier_pred %>%
ungroup() %>%
arrange((deaths_per_mill)) %>%
slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
## Country_Region cases_per_mill deaths_per_mill pred date cases deaths
## <chr> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
## 1 Nicaragua 2363. 37.0 931. 2023-03-08 1.57e4 2.45e2
## 2 Haiti 3000. 75.4 935. 2023-03-04 3.42e4 8.6 e2
## 3 Venezuela 19418. 206. 1037. 2023-03-09 5.52e5 5.85e3
## 4 Dominican Repu… 60914. 404. 1297. 2023-03-08 6.61e5 4.38e3
## 5 El Salvador 31110. 651. 1111. 2022-08-30 2.02e5 4.22e3
## 6 Argentina 222254. 2887. 2306. 2023-03-06 1.00e7 1.30e5
## 7 Trinidad and T… 135705. 3112. 1765. 2023-03-07 1.90e5 4.36e3
## 8 Brazil 174451. 3290. 2007. 2023-03-03 3.71e7 6.99e5
## 9 Chile 271617. 3362. 2615. 2023-03-09 5.19e6 6.43e4
## 10 US 315063. 3411. 2886. 2023-03-09 1.04e8 1.12e6
## # … with 2 more variables: Population <dbl>, Continent <chr>
americas_max_cases_outlier_pred %>%
ungroup() %>%
arrange((pred)) %>%
slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
## Country_Region cases_per_mill deaths_per_mill pred date cases deaths
## <chr> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
## 1 Nicaragua 2363. 37.0 931. 2023-03-08 1.57e4 2.45e2
## 2 Haiti 3000. 75.4 935. 2023-03-04 3.42e4 8.6 e2
## 3 Venezuela 19418. 206. 1037. 2023-03-09 5.52e5 5.85e3
## 4 El Salvador 31110. 651. 1111. 2022-08-30 2.02e5 4.22e3
## 5 Honduras 47680. 1122. 1214. 2023-03-06 4.72e5 1.11e4
## 6 Panama 239116. 1995. 2411. 2023-03-06 1.03e6 8.61e3
## 7 Chile 271617. 3362. 2615. 2023-03-09 5.19e6 6.43e4
## 8 Uruguay 297750. 2193. 2778. 2023-02-27 1.03e6 7.62e3
## 9 US 315063. 3411. 2886. 2023-03-09 1.04e8 1.12e6
## 10 Barbados 371638. 2015. 3240. 2023-03-09 1.07e5 5.79e2
## # … with 2 more variables: Population <dbl>, Continent <chr>
Base model: In the base model, the model has an intercept value of 1054. The slope value is 0.006269; for every additional case per million, it is expect that there is an additional 0.006269 deaths per million. The model has an adjusted R-squared value of 0.1822 and a residual standard error of 1135 with 33 degrees of freedom. In this version, Nicaragua has the lowest actual and predicted deaths per million. Peru has the highest actual deaths per million, whereas Barbados has the highest predicted deaths per million.
Alternate model - outlier removed : In this version of the model, the model has an intercept value of 916. The slope value is 0.006254 for every additional case per million, it is expected that there is an additional 0.006254 deaths per million. The model has an adjusted R-squared value of 0.3431 and a residual standard error of 776.5 with 32 degrees of freedom. In this version, Nicaragua has the lowest actual and predicted deaths per million. The United States has the highest actual deaths per million, and Barbados has the highest predicted deaths per million.
When outliers are removed, there is a moderate linear correlation between the cases and deaths per million in the Americas. This analysis is a preliminary one to demonstrate the correlation between just two variables. Future directions can expand and potentially improve on this model. Some future directions that this model may take include vaccination rates, temporal effects, and input related to variants. Different versions of this model may also expand the data to include other countries, or subdivide the data into “North” and “South” America.
The visualizations and analysis presented in this report provide insight into the trends and patterns regarding the COVID-19 pandemic. However, it is important to examine potential sources of bias when drawing conclusion about the results.
One potential source of bias lies in the reporting measures. Global data contains confirmed cases and deaths as a result of COVID-19. If the way in which data is captured and recorded relating this data is not consistent across individual countries, this may be reflected as inaccuracies or inconsistencies in the analysis.
On a systemic level, interpretation of the data needs to be handled with care. It is important to be mindful of how access to hospital equipment, vaccines, and personal protective equipment may be reflected in the number of cases and deaths. Furthermore, the cultural conception of public health may result in differing approaches to combating COVID-19, which may also be reflected in observed trends. Another systemic consideration that may have an affect on the analysis is the racial and ethnic disparities in COVID-19 infection, hospitalization, and transmission that may persist due to a number of socioeconomic factors.
It is possible that personal biases may have influenced the data analysis. Topics of contention that have the potential to influence the analysis include a Western conception of medicine, health, and treatment; full support of vaccinations being mandatory; a scientific background in biology and public health; and the benefit of hindsight when drawing conclusions about the data after the pandemic has ended. To mitigate these personal biases, the analysis presented is focused on objective results and information. The conclusions drawn pertain only to what can be ascertained statistically. While certain forms of biases cannot be fully prevented, they have been recorded to promote transparency in data analysis.
The report presented has produced results related to the temporal and geospatial aspects of the COVID-19 pandemic. There are a number of future directions that can be taken to expand or enhance the results presented. Furthering the understanding of a global pandemic will help promote an understanding in preventing and controlling future global pandemics.